作为智能机器人的一项基本任务,Visual Slam在过去几十年中取得了长足的进步。但是,在高度弱质地的环境下,强大的大满贯仍然非常具有挑战性。在本文中,我们提出了一个名为RWT-Slam的新型视觉大满贯系统,以解决这个问题。我们修改LOFTR网络,该网络能够在低纹理的场景下产生密集的点匹配以生成特征描述符。为了将新功能集成到流行的Orb-Slam框架中,我们开发了功能面具,以滤除不可靠的功能并采用KNN策略来增强匹配的鲁棒性。我们还对新的描述符进行了视觉词汇,以有效地循环结束。在TUM和Openloris等各种公共数据集以及我们自己的数据中测试了由此产生的RWT-SLAM。结果显示在高度弱质地的环境下表现非常有希望。
translated by 谷歌翻译
Inspired by the impressive success of contrastive learning (CL), a variety of graph augmentation strategies have been employed to learn node representations in a self-supervised manner. Existing methods construct the contrastive samples by adding perturbations to the graph structure or node attributes. Although impressive results are achieved, it is rather blind to the wealth of prior information assumed: with the increase of the perturbation degree applied on the original graph, 1) the similarity between the original graph and the generated augmented graph gradually decreases; 2) the discrimination between all nodes within each augmented view gradually increases. In this paper, we argue that both such prior information can be incorporated (differently) into the contrastive learning paradigm following our general ranking framework. In particular, we first interpret CL as a special case of learning to rank (L2R), which inspires us to leverage the ranking order among positive augmented views. Meanwhile, we introduce a self-ranking paradigm to ensure that the discriminative information among different nodes can be maintained and also be less altered to the perturbations of different degrees. Experiment results on various benchmark datasets verify the effectiveness of our algorithm compared with the supervised and unsupervised models.
translated by 谷歌翻译
最小化未标记数据的预测不确定性是在半监督学习(SSL)中实现良好性能的关键因素。预测不确定性通常表示为由输出空间中的转换概率计算的\ emph {熵}。大多数现有工程通过接受确定类(具有最大概率)作为真实标签或抑制微妙预测(具有较小概率)来蒸馏低熵预测。无论如何,这些蒸馏策略通常是模型培训的启发式和更少的信息。从这种辨别中,本文提出了一个名为自适应锐化(\ ADS)的双机制,首先将软阈值应用于自适应掩盖确定和可忽略不计的预测,然后无缝地锐化通知的预测,与通知的预测蒸馏出某些预测只要。更重要的是,我们通过与各种蒸馏策略进行比较理论上,从理论上分析\广告的特征。许多实验验证\广告通过使其显着提高了最先进的SSL方法。我们提出的\ ADS为未来蒸馏的SSL研究造成一个基石。
translated by 谷歌翻译
相对属性(RA),参考在特定属性的强度上的两个图像上的偏好,可以使由于其丰富的语义信息来实现良好的图像到图像转换。然而,基于RAS的现有工作未能调和细粒度翻译的目标以及高质量一代的目标。我们提出了一个新的模型之旅,以协调这两个目标,以获得高质量的细粒度翻译。特别是,我们同时培训了两个模块:一个发电机,它将输入图像转换为所需图像,具有相对于感兴趣的属性的平滑微妙变化;和排名由输入图像和所需图像组成的竞争偏好的排名。竞争对手的偏好是指对抗性排名过程:(1)排名师在所需属性方面认为所需图像和输入图像之间没有差异; (2)发电机欺骗排名师以相信所需图像根据需要在输入图像上改变属性。介绍了RAS成对的真实图像,以指导排名仪对仅对感兴趣的属性进行排名对。通过有效的排名,发电机将通过产生与输入图像相比,通过产生所需改变的高质量图像来“赢得”对抗游戏。两个面部图像数据集和一个鞋图像数据集的实验表明,我们的旅行实现了最先进的导致生成高保真图像,这表现出对感兴趣的属性的平滑变化。
translated by 谷歌翻译
本文提出了差异性批判性生成对抗网络(DICGAN),以了解只有部分而不是整个数据集具有所需属性时用户呈现数据的分布。 Dicgan生成了满足用户期望的所需数据,并可以协助设计具有所需特性的生物产品。现有方法首先选择所需的样品,然后在选定样品上训练常规甘斯以得出用户呈现的数据分布。但是,所需数据的选择取决于整个数据集的全球知识和监督。 Dicgan介绍了一个差异评论家,该评论家从成对的偏好中学习,这些偏好是本地知识,可以在培训数据的一部分中定义。批评家是通过定义与瓦斯坦斯坦·甘(Wasserstein Gan)批评家的额外排名损失来建立的。它赋予每对样本之间的评论值差异,并具有用户喜好,并指导所需数据的生成而不是整个数据。为了获得更有效的解决方案以确保数据质量,我们将Dicgan进一步重新重新将其作为约束优化问题,基于理论上证明了我们的Dicgan的收敛性。对具有各种应用程序的各种数据集进行的广泛实验表明,我们的Dicgan在学习用户呈现的数据分布方面取得了最新的性能,尤其是在不足的所需数据和有限的监督下。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译
Domain adaptive detection aims to improve the generalization of detectors on target domain. To reduce discrepancy in feature distributions between two domains, recent approaches achieve domain adaption through feature alignment in different granularities via adversarial learning. However, they neglect the relationship between multiple granularities and different features in alignment, degrading detection. Addressing this, we introduce a unified multi-granularity alignment (MGA)-based detection framework for domain-invariant feature learning. The key is to encode the dependencies across different granularities including pixel-, instance-, and category-levels simultaneously to align two domains. Specifically, based on pixel-level features, we first develop an omni-scale gated fusion (OSGF) module to aggregate discriminative representations of instances with scale-aware convolutions, leading to robust multi-scale detection. Besides, we introduce multi-granularity discriminators to identify where, either source or target domains, different granularities of samples come from. Note that, MGA not only leverages instance discriminability in different categories but also exploits category consistency between two domains for detection. Furthermore, we present an adaptive exponential moving average (AEMA) strategy that explores model assessments for model update to improve pseudo labels and alleviate local misalignment problem, boosting detection robustness. Extensive experiments on multiple domain adaption scenarios validate the superiority of MGA over other approaches on FCOS and Faster R-CNN detectors. Code will be released at https://github.com/tiankongzhang/MGA.
translated by 谷歌翻译